library(topGO)
library(knitr)   # to use kable()
library(limma)   # to use vennDiagram()
library(ggplot2)
TAG <- params$TAG
VAR <- params$VAR
ANNOTATION <- list(genes = '../2019-07-26/genes/annotation.txt',
                   isoforms = '../2019-07-26/transcripts/annotation.txt')
TAGDIR <- paste0('../2020-01-08/', TAG, '/')

Introduction

This is the enrichment analysis for genes regulated by hatching. Because I quantified the association of expression with hatching in two different ways, this document has two sections. Section Variance reports the enrichment analysis performed with genes ordered by the proportion of expression variance explained by hatching. Note that some genes with low variance may still be highly associated with hatching, even if the fold change between levels of this factor is low. Section Differential expression uses an ordering of genes based on the significance of the differential expression between levels of hatching, which does depend on fold change.

Reading the data

Functional annotation is in 2019-07-26. I will also upload two lists of genes, with either proportion of variance explained by hatching or p-value of differential expression test.

tagVariance <- read.table(paste0(TAGDIR, VAR, '_variance.txt'))
tagPValue   <- read.table(paste0(TAGDIR, VAR, '_pvalue.txt'))
annotation  <- read.table(ANNOTATION[[TAG]], col.names = c('tagname', 'goterms'))

To initialize the topGOdata object, I will need the gene list as a named numeric vector, where the names are the genes identifiers and the numeric values, either the portion of variance explained by hatching or the p-values of the differential expression test. The structure() function adds attributes to an object.

Variance <- structure(tagVariance[,1], names = row.names(tagVariance))
PValues  <- structure(tagPValue[,1],   names = row.names(tagPValue))
rm(tagVariance, tagPValue)

There are 18598 genes scored with a variance portion and a differential expression p-value. It should actually be the exact same genes. The annotation data frame has more than one GO term for every tag, separated by |. I need a named list.

head(annotation)
##       tagname                                                           goterms
## 1 XLOC_000002                                  GO:0008417|GO:0016020|GO:0006486
## 2 XLOC_000009                                  GO:0043130|GO:0005515|GO:0043161
## 3 XLOC_000010                       GO:0005840|GO:0015935|GO:0003735|GO:0006412
## 4 XLOC_000015 GO:0006355|GO:0003700|GO:0006357|GO:0043565|GO:0005634|GO:0032502
## 5 XLOC_000021                       GO:0016567|GO:0006397|GO:0061630|GO:0008270
## 6 XLOC_000036                                  GO:0005272|GO:0006814|GO:0016020
allgenes2GO <- strsplit(as.character(annotation$goterms), "|", fixed = TRUE)
names(allgenes2GO) <- annotation$tagname
rm(annotation)

There are 9691 genes with GO annotations. But the differential expression analysis includes many more genes. In order to include the not-annotated genes in the enrichment analysis, to see if annotated or not annotated genes are more or less often differentially expressed, I should attribute a GO term to them. According to [http://geneontology.org/docs/faq/] nowadays we express lack of annotation by annotating to the root nodes, i.e. GO:0008150 biological_process, GO:0003674 molecular_function, and GO:0005575 cellular_component.

for (tag in unique(c(names(PValues), names(Variance)))) {
   if (! tag %in% names(allgenes2GO)) {
      allgenes2GO <- append(allgenes2GO,
         structure(list(c("GO:0008150", "GO:0003674", "GO:0005575")), names = tag))
   }
}

Using differential expression p-values

Building the topGO object

Creation of a topGO dataset is documented in section 4 of topGO’s the user manual: https://bioconductor.org/packages/release/bioc/vignettes/topGO/inst/doc/topGO.pdf. I need to use the new command, and fill up the slots. The annot object must be a function that compiles “a list of GO terms such that each element in the list is a character vector containing all the gene identifiers that are mapped to the respective GO term.” There are several options, that you can check using help(annFUN.gene2GO), for example. The annFUN.gene2GO requires a gene2GO argument, which is the list of gene-to-GO terms I made before. The geneSelectionFun object is a function that selects the interesting (most significant) genes. It is required to perform Fisher’s exact test. The nodeSize is used to prune the GO hierarchy from the terms which have less than n annotated genes.

I generate three datasets, to analyse each of the three ontologies.

selection <- function(allScores) {return(allScores < 0.05)}
GOdata.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdata.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = PValues,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), numGenes),
   Num_GO_terms = sapply(list(GOdata.BP, GOdata.MF, GOdata.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 15162 1154
MF 17429 576
CC 12940 291

Running the tests

There are more than one way to test for enrichment. Something that took me a while to understand is that not only there are different test statistics (Fisher’s exact test, Kolmogorov-Smirnov, and others) but also different algorithms: classic, elim, weight… The algorithms are ways to deal with the dependence structure among GO terms due to topology. Some algorithms are compatible with all statistics implemented in topGO. But the weight and the parentchild algorithms are only applicable to Fisher’s exact test. I am not interested in the classic algorithm, which treats GO nodes as independent, and therefore produces an excess of significant results. I will not use the Fisher’s exact test, because it dependes on an arbitrary threshold of significance on non-adjusted p-values.

BP.elim     <- runTest(GOdata.BP, algorithm = "elim",     statistic = "ks")
BP.weight01 <- runTest(GOdata.BP, algorithm = "weight01", statistic = "ks")
BP.lea      <- runTest(GOdata.BP, algorithm = "lea",      statistic = "ks")
MF.elim     <- runTest(GOdata.MF, algorithm = "elim",     statistic = "ks")
MF.weight01 <- runTest(GOdata.MF, algorithm = "weight01", statistic = "ks")
MF.lea      <- runTest(GOdata.MF, algorithm = "lea",      statistic = "ks")
CC.elim     <- runTest(GOdata.CC, algorithm = "elim",     statistic = "ks")
CC.weight01 <- runTest(GOdata.CC, algorithm = "weight01", statistic = "ks")
CC.lea      <- runTest(GOdata.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) length(score(X))),
   Significant = sapply(list(BP.elim, BP.weight01, BP.lea, MF.elim, MF.weight01, MF.lea, CC.elim, CC.weight01, CC.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1154 23
BP weight01 1154 12
BP lea 1154 45
MF elim 576 13
MF weight01 576 14
MF lea 576 25
CC elim 291 6
CC weight01 291 4
CC lea 291 9
rm(ResultsSummary)

Results

The topGO package does not pay much attention to what terms are significant because they are overrepresented and which ones are underrepresented among the most differentially expressed genes. I think it’s worth separating them, to facilitate the biological interpretation. Note that not all terms listed in the tables below are significant. The scores for the three methods (elim, weight01 and lea) are non-corrected p-values.

Biological process

orderedTerms <- names(sort(score(BP.weight01)))
significant.weight01 <- score(BP.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BP.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BP.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.pvalue.sigTerms <- sigTerms

BP.all <- GenTable(GOdata.BP, elim=BP.elim, weight01=BP.weight01, lea=BP.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   BP.all[BP.all$Significant > BP.all$Expected,],
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
3 GO:0006486 protein glycosylation 75 5 2.90 4 0.0018 0.0015 0.00177
4 GO:1901264 carbohydrate derivative transport 11 2 0.43 212 0.1233 0.0029 0.12333
5 GO:0015931 nucleobase-containing compound transport 29 2 1.12 498 0.4205 0.0029 0.42046
6 GO:0034314 Arp2/3 complex-mediated actin nucleation 12 2 0.46 8 0.0031 0.0031 0.00315
8 GO:0072330 monocarboxylic acid biosynthetic process 24 2 0.93 595 0.5189 0.0044 0.51887
11 GO:0009968 negative regulation of signal transducti… 9 1 0.35 9 0.0036 0.0064 0.00365
13 GO:0003341 cilium movement 10 1 0.39 30 0.0110 0.0110 0.01100
14 GO:0001522 pseudouridine synthesis 13 1 0.50 34 0.0120 0.0120 0.01201
15 GO:0035435 phosphate ion transmembrane transport 8 1 0.31 35 0.0123 0.0123 0.01230
16 GO:0044341 sodium-dependent phosphate transport 8 1 0.31 36 0.0123 0.0123 0.01230
23 GO:0006030 chitin metabolic process 74 3 2.86 224 0.1318 0.0191 0.13182
kable(
   BP.all[BP.all$Significant < BP.all$Expected,],
   caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0060271 cilium assembly 46 1 1.78 1 8e-07 6.9e-06 8e-07
2 GO:0046907 intracellular transport 202 6 7.81 439 0.3577 0.0014 0.35774
7 GO:0007264 small GTPase mediated signal transductio… 81 2 3.13 10 0.0039 0.0032 0.00393
9 GO:0043604 amide biosynthetic process 215 5 8.31 696 0.6011 0.0044 0.47885
10 GO:0006418 tRNA aminoacylation for protein translat… 40 0 1.55 13 0.0044 0.0044 0.00445
12 GO:0016579 protein deubiquitination 40 1 1.55 17 0.0064 0.0064 0.00642
17 GO:0006511 ubiquitin-dependent protein catabolic pr… 102 2 3.94 714 0.6260 0.0135 0.62602
18 GO:0006012 galactose metabolic process 6 0 0.23 39 0.0143 0.0143 0.01430
19 GO:0098813 nuclear chromosome segregation 27 1 1.04 436 0.3523 0.0175 0.35227
20 GO:0035556 intracellular signal transduction 173 6 6.69 23 0.0096 0.0182 0.00022
21 GO:1901135 carbohydrate derivative metabolic proces… 300 11 11.59 370 0.2625 0.0190 0.45764
22 GO:0019637 organophosphate metabolic process 204 6 7.88 511 0.4352 0.0191 0.68997
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term = Term(GOTERM[sigTerms]),
                 Definition = Definition(GOTERM[sigTerms]),
                 PValue=score(BP.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with',
                      VAR, 'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0000069
GO:0006486 protein glycosylation A protein modification process that results in the addition of a carbohydrate or carbohydrate derivative unit to a protein amino acid, e.g. the addition of glycan chains to proteins. 0.0015156
GO:0034314 Arp2/3 complex-mediated actin nucleation The actin nucleation process in which actin monomers combine to form a new branch on the side of an existing actin filament; mediated by the Arp2/3 protein complex and its interaction with other proteins. 0.0031481
GO:0007264 small GTPase mediated signal transduction Any series of molecular signals in which a small monomeric GTPase relays one or more of the signals. 0.0032176
GO:0006418 tRNA aminoacylation for protein translation The synthesis of aminoacyl tRNA by the formation of an ester bond between the 3’-hydroxyl group of the most 3’ adenosine of the tRNA and the alpha carboxylic acid group of an amino acid, to be used in ribosome-mediated polypeptide synthesis. 0.0044477
GO:0009968 negative regulation of signal transduction Any process that stops, prevents, or reduces the frequency, rate or extent of signal transduction. 0.0063763
GO:0016579 protein deubiquitination The removal of one or more ubiquitin groups from a protein. 0.0064217

I think the GO graph is useful to see the relationship among the significant terms. But too large graphs are impossible to read. I don’t know how to split the graph in meaningful subgraphs.

showSigOfNodes(GOdata.BP, score(BP.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 186 
## Number of Edges = 347 
## 
## $complete.dag
## [1] "A graph with 186 nodes."

This is just a example of the most significant GO term:

showGroupDensity(GOdata.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function

orderedTerms <- names(sort(score(MF.weight01)))
significant.weight01 <- score(MF.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MF.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MF.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.pvalue.sigTerms <- sigTerms

MF.all <- GenTable(GOdata.MF, elim=MF.elim, weight01=MF.weight01, lea=MF.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   MF.all[MF.all$Significant > MF.all$Expected,],
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
2 GO:0008417 fucosyltransferase activity 33 4 1.24 3 0.00076 0.00076 0.00076
6 GO:0016757 transferase activity, transferring glyco… 194 10 7.29 2 0.00039 0.00259 1.2e-06
7 GO:1901505 carbohydrate derivative transmembrane tr… 22 2 0.83 288 0.45210 0.00293 0.45210
10 GO:0004842 ubiquitin-protein transferase activity 76 3 2.86 150 0.18724 0.00678 0.18724
11 GO:0005085 guanyl-nucleotide exchange factor activi… 58 3 2.18 60 0.04614 0.00722 0.04614
12 GO:0008061 chitin binding 77 5 2.89 9 0.00757 0.00757 0.00757
kable(
   MF.all[MF.all$Significant < MF.all$Expected,],
   caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005515 protein binding 2101 66 78.96 1 5e-08 1.6e-10 1.6e-09
3 GO:0005509 calcium ion binding 361 11 13.57 4 0.00088 0.00088 0.00088
4 GO:0036459 thiol-dependent ubiquitinyl hydrolase ac… 43 1 1.62 27 0.01870 0.00230 0.01870
5 GO:0004812 aminoacyl-tRNA ligase activity 42 0 1.58 5 0.00240 0.00240 0.00240
8 GO:0008536 Ran GTPase binding 14 0 0.53 6 0.00318 0.00318 0.00318
9 GO:0004707 MAP kinase activity 6 0 0.23 7 0.00634 0.00634 0.00634
13 GO:0005524 ATP binding 786 19 29.54 10 0.00877 0.00877 0.00877
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MF.weight01)[sigTerms]),
      caption = paste('Molecular function terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular function terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0005515 protein binding Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules). 0.0000000
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0007586
GO:0005509 calcium ion binding Interacting selectively and non-covalently with calcium ions (Ca2+). 0.0008786
GO:0004812 aminoacyl-tRNA ligase activity Catalysis of the formation of aminoacyl-tRNA from ATP, amino acid, and tRNA with the release of diphosphate and AMP. 0.0024049
GO:0016757 transferase activity, transferring glycosyl groups Catalysis of the transfer of a glycosyl group from one compound (donor) to another (acceptor). 0.0025943
GO:0008536 Ran GTPase binding Interacting selectively and non-covalently with Ran, a conserved Ras-like GTP-binding protein, implicated in nucleocytoplasmic transport, cell cycle progression, spindle assembly, nuclear organization and nuclear envelope (NE) assembly. 0.0031825
GO:0004707 MAP kinase activity Catalysis of the reaction: protein + ATP = protein phosphate + ADP. This reaction is the phosphorylation of proteins. Mitogen-activated protein kinase; a family of protein kinases that perform a crucial step in relaying signals from the plasma membrane to the nucleus. They are activated by a wide range of proliferation- or differentiation-inducing signals; activation is strong with agonists such as polypeptide growth factors and tumor-promoting phorbol esters, but weak (in most cell backgrounds) by stress stimuli. 0.0063440
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0075727
GO:0005524 ATP binding Interacting selectively and non-covalently with ATP, adenosine 5’-triphosphate, a universally important coenzyme and enzyme regulator. 0.0087728
GO:0008375 acetylglucosaminyltransferase activity Catalysis of the transfer of an N-acetylglucosaminyl residue from UDP-N-acetyl-glucosamine to a sugar. 0.0092534
showSigOfNodes(GOdata.MF, score(MF.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 61 
## Number of Edges = 78 
## 
## $complete.dag
## [1] "A graph with 61 nodes."
showGroupDensity(GOdata.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component

orderedTerms <- names(sort(score(CC.weight01)))
significant.weight01 <- score(CC.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CC.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CC.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.pvalue.sigTerms <- sigTerms

CC.all <- GenTable(GOdata.CC, elim=CC.elim, weight01=CC.weight01, lea=CC.lea,
                   orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
   CC.all[CC.all$Significant > CC.all$Expected,],
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
3 GO:0005815 microtubule organizing center 37 2 1.43 11 0.01340 0.0046 0.01340
4 GO:0005885 Arp2/3 protein complex 8 2 0.31 5 0.00799 0.0080 0.00799
kable(
   CC.all[CC.all$Significant < CC.all$Expected,],
   caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0030286 dynein complex 26 0 1.00 1 0.00024 0.0022 0.00024
2 GO:0032991 protein-containing complex 886 20 34.17 31 0.04084 0.0037 0.03349
5 GO:0005680 anaphase-promoting complex 7 0 0.27 7 0.01124 0.0112 0.01124
6 GO:0044428 nuclear part 236 4 9.10 39 0.05022 0.0122 0.07962
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CC.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0030286 dynein complex Any of several large complexes that contain two or three dynein heavy chains and several light chains, and have microtubule motor activity. 0.0021741
GO:0005885 Arp2/3 protein complex A stable protein complex that contains two actin-related proteins, Arp2 and Arp3, and five novel proteins (ARPC1-5), and functions in the nucleation of branched actin filaments. 0.0079859
showSigOfNodes(GOdata.CC, score(CC.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 30 
## Number of Edges = 48 
## 
## $complete.dag
## [1] "A graph with 30 nodes."
showGroupDensity(GOdata.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Using the portion of variance explained by hatching

Building the topGO object

I need to generate the topGO objects again, using the alternative gene ordering, based on the proportion of expression-level variance explained by hatching. I miss a way to inform the topGOdata object that the score in this case is reversed, with respect to p-values: the higher it is, the more differentially expressed the gene is. To make sure that GO terms are tested in the same way than when using p-values, I will just reverse the proportion of variance explained by hatching to its complement. Taking this into account, the subset of interesting genes (selection() function) must be defined as the lowest 10% scores, which are the 10% genes with largest portion of variance explained by hatching.

selection <- function(allScores) {return(allScores <= quantile(allScores, probs = 0.10))}
GOdataVar.BP <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'BP', 
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.MF <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'MF',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

GOdataVar.CC <- new('topGOdata',
   description = 'Differential expression among Brachionus plicatilis populations.',
   ontology = 'CC',
   allGenes = 1.0 - Variance,
   annot = annFUN.gene2GO,
   gene2GO = allgenes2GO,
   geneSelectionFun = selection,
   nodeSize = 5)

library(knitr)
DataSummary <- data.frame(ontology = c('BP', 'MF', 'CC'),
   Num_Genes = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), numGenes),
   Num_GO_terms = sapply(list(GOdataVar.BP, GOdataVar.MF, GOdataVar.CC), function(x) length(usedGO(x))))
kable(DataSummary, caption='Number of feasible genes or transcripts and number of GO terms used in each data set.')
rm(DataSummary)
Number of feasible genes or transcripts and number of GO terms used in each data set.
ontology Num_Genes Num_GO_terms
BP 15162 1154
MF 17429 576
CC 12940 291

Running the tests

BPvar.elim     <- runTest(GOdataVar.BP, algorithm = "elim",     statistic = "ks")
BPvar.weight01 <- runTest(GOdataVar.BP, algorithm = "weight01", statistic = "ks")
BPvar.lea      <- runTest(GOdataVar.BP, algorithm = "lea",      statistic = "ks")
MFvar.elim     <- runTest(GOdataVar.MF, algorithm = "elim",     statistic = "ks")
MFvar.weight01 <- runTest(GOdataVar.MF, algorithm = "weight01", statistic = "ks")
MFvar.lea      <- runTest(GOdataVar.MF, algorithm = "lea",      statistic = "ks")
CCvar.elim     <- runTest(GOdataVar.CC, algorithm = "elim",     statistic = "ks")
CCvar.weight01 <- runTest(GOdataVar.CC, algorithm = "weight01", statistic = "ks")
CCvar.lea      <- runTest(GOdataVar.CC, algorithm = "lea",      statistic = "ks")

ResultsSummary <- data.frame(ontology = rep(c("BP", "MF", "CC"), each = 3),
   algorithm = rep(c("elim", "weight01", "lea"), 3),
   TermsTested = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) length(score(X))),
   Significant = sapply(list(BPvar.elim, BPvar.weight01, BPvar.lea, MFvar.elim, MFvar.weight01, MFvar.lea, CCvar.elim, CCvar.weight01, CCvar.lea), function(X) sum(score(X) < 0.01)))

kable(ResultsSummary, caption="Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.")
Number of non-trivial terms tested and those with a score (not corrected p-value) lower than 0.01.
ontology algorithm TermsTested Significant
BP elim 1154 39
BP weight01 1154 18
BP lea 1154 74
MF elim 576 22
MF weight01 576 17
MF lea 576 29
CC elim 291 11
CC weight01 291 9
CC lea 291 22
rm(ResultsSummary)

Results

Biological process

orderedTerms <- names(sort(score(BPvar.weight01)))
significant.weight01 <- score(BPvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(BPvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(BPvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
BP.variance.sigTerms <- sigTerms

BPvar.all <- GenTable(GOdataVar.BP, elim=BPvar.elim, weight01=BPvar.weight01, lea=BPvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))

kable(
   BPvar.all[BPvar.all$Significant > BPvar.all$Expected,],
   caption = "Most over-represented terms of the Biological Process ontology.")
Most over-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0046907 intracellular transport 202 30 19.97 136 0.03824 0.00018 0.00241
3 GO:0007264 small GTPase mediated signal transductio… 81 12 8.01 3 0.00034 0.00075 0.00034
7 GO:0007020 microtubule nucleation 11 2 1.09 10 0.00224 0.00224 0.00224
8 GO:0015931 nucleobase-containing compound transport 29 3 2.87 211 0.08131 0.00225 0.08131
9 GO:1901264 carbohydrate derivative transport 11 2 1.09 150 0.04630 0.00225 0.04630
11 GO:0034314 Arp2/3 complex-mediated actin nucleation 12 4 1.19 17 0.00291 0.00291 0.00291
12 GO:0006417 regulation of translation 16 3 1.58 47 0.01119 0.00423 0.01119
14 GO:0048193 Golgi vesicle transport 43 6 4.25 428 0.25589 0.00506 0.36524
16 GO:0032006 regulation of TOR signaling 7 3 0.69 28 0.00672 0.00672 0.00672
17 GO:0006302 double-strand break repair 21 4 2.08 8 0.00189 0.00724 0.00189
18 GO:1902533 positive regulation of intracellular sig… 6 3 0.59 35 0.00967 0.00967 0.00967
20 GO:0006012 galactose metabolic process 6 2 0.59 46 0.01093 0.01093 0.01093
22 GO:0046854 phosphatidylinositol phosphorylation 11 2 1.09 62 0.01364 0.01364 0.01364
23 GO:0006672 ceramide metabolic process 8 3 0.79 336 0.16508 0.01517 0.16508
25 GO:0006357 regulation of transcription by RNA polym… 58 7 5.73 119 0.03000 0.01610 0.03000
26 GO:0006606 protein import into nucleus 9 2 0.89 84 0.01887 0.01887 0.01887
27 GO:0006325 chromatin organization 60 13 5.93 49 0.01146 0.01903 0.00282
28 GO:0007275 multicellular organism development 41 5 4.05 335 0.16442 0.01931 0.16442
29 GO:0051650 establishment of vesicle localization 6 2 0.59 90 0.01972 0.01972 0.01972
30 GO:0006506 GPI anchor biosynthetic process 24 5 2.37 92 0.02027 0.02027 0.02027
31 GO:0009967 positive regulation of signal transducti… 9 5 0.89 99 0.02283 0.02283 0.00026
32 GO:0016573 histone acetylation 12 5 1.19 101 0.02325 0.02325 0.02325
34 GO:0031323 regulation of cellular metabolic process 423 53 41.82 163 0.05053 0.02474 0.00081
35 GO:0009968 negative regulation of signal transducti… 9 3 0.89 18 0.00300 0.02493 0.00300
36 GO:0050953 sensory perception of light stimulus 8 2 0.79 666 0.49177 0.02495 0.49177
37 GO:0007156 homophilic cell adhesion via plasma memb… 21 3 2.08 106 0.02580 0.02580 0.02580
38 GO:0009894 regulation of catabolic process 5 2 0.49 112 0.02793 0.02793 0.02793
kable(
   BPvar.all[BPvar.all$Significant < BPvar.all$Expected,],
   caption = "Most under-represented terms of the Biological Process ontology.")
Most under-represented terms of the Biological Process ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
2 GO:0060271 cilium assembly 46 2 4.55 1 7e-05 0.00063 7e-05
4 GO:0006486 protein glycosylation 75 7 7.41 19 0.00314 0.00092 0.00314
5 GO:0006511 ubiquitin-dependent protein catabolic pr… 102 7 10.08 456 0.29323 0.00113 0.29323
6 GO:0006030 chitin metabolic process 74 4 7.32 55 0.01247 0.00163 0.01247
10 GO:0006413 translational initiation 18 0 1.78 12 0.00268 0.00268 0.00268
13 GO:0040029 regulation of gene expression, epigeneti… 6 0 0.59 23 0.00481 0.00481 0.00481
15 GO:0016310 phosphorylation 319 21 31.54 557 0.38238 0.00570 0.63116
19 GO:0016579 protein deubiquitination 40 3 3.95 44 0.01061 0.01061 0.01061
21 GO:0006418 tRNA aminoacylation for protein translat… 40 1 3.95 57 0.01302 0.01302 0.01302
24 GO:0044262 cellular carbohydrate metabolic process 28 1 2.77 875 0.70059 0.01608 0.70059
33 GO:0033365 protein localization to organelle 38 3 3.76 30 0.00726 0.02412 0.00726
39 GO:0006457 protein folding 49 3 4.84 113 0.02800 0.02800 0.02800
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(BPvar.weight01)[sigTerms]),
      caption = paste('Biological process terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Biological process terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0060271 cilium assembly The assembly of a cilium, a specialized eukaryotic organelle that consists of a filiform extrusion of the cell surface. Each cilium is bounded by an extrusion of the cytoplasmic membrane, and contains a regular longitudinal array of microtubules, anchored basally in a centriole. 0.0006298
GO:0007264 small GTPase mediated signal transduction Any series of molecular signals in which a small monomeric GTPase relays one or more of the signals. 0.0007542
GO:0006486 protein glycosylation A protein modification process that results in the addition of a carbohydrate or carbohydrate derivative unit to a protein amino acid, e.g. the addition of glycan chains to proteins. 0.0009226
GO:0007020 microtubule nucleation The process in which tubulin alpha-beta heterodimers begin aggregation to form an oligomeric tubulin structure (a microtubule seed). Microtubule nucleation is the initiating step in the formation of a microtubule in the absence of any existing microtubules (‘de novo’ microtubule formation). 0.0022416
GO:0006413 translational initiation The process preceding formation of the peptide bond between the first two amino acids of a protein. This includes the formation of a complex of the ribosome, mRNA or circRNA, and an initiation complex that contains the first aminoacyl-tRNA. 0.0026807
GO:0034314 Arp2/3 complex-mediated actin nucleation The actin nucleation process in which actin monomers combine to form a new branch on the side of an existing actin filament; mediated by the Arp2/3 protein complex and its interaction with other proteins. 0.0029146
GO:0040029 regulation of gene expression, epigenetic Any process that modulates the frequency, rate or extent of gene expression; the process is mitotically or meiotically heritable, or is stably self-propagated in the cytoplasm of a resting cell, and does not entail a change in DNA sequence. 0.0048076
GO:0032006 regulation of TOR signaling Any process that modulates the frequency, rate or extent of TOR signaling. 0.0067152
GO:0006302 double-strand break repair The repair of double-strand breaks in DNA via homologous and nonhomologous mechanisms to reform a continuous DNA helix. 0.0072352
GO:1902533 positive regulation of intracellular signal transduction NA 0.0096696
showSigOfNodes(GOdataVar.BP, score(BPvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 290 
## Number of Edges = 563 
## 
## $complete.dag
## [1] "A graph with 290 nodes."

Below I plot variance portion, but for the termo found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.BP, names(score(BP.weight01))[which.min(score(BP.weight01))])

Molecular function

orderedTerms <- names(sort(score(MFvar.weight01)))
significant.weight01 <- score(MFvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(MFvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(MFvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
MF.variance.sigTerms <- sigTerms

MFvar.all <- GenTable(GOdataVar.MF, elim=MFvar.elim, weight01=MFvar.weight01, lea=MFvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=sum(significant.elim))
kable(
   MFvar.all[MFvar.all$Significant > MFvar.all$Expected,],
   caption = "Most over-represented terms of the Molecular Function ontology.")
Most over-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005515 protein binding 2101 235 207.34 1 3.9e-07 1.9e-08 4.4e-09
2 GO:0008536 Ran GTPase binding 14 3 1.38 2 4.7e-05 4.7e-05 4.7e-05
3 GO:0043015 gamma-tubulin binding 9 3 0.89 3 0.00012 0.00012 0.00012
7 GO:0008417 fucosyltransferase activity 33 4 3.26 8 0.00114 0.00114 0.00114
9 GO:0003714 transcription corepressor activity 10 5 0.99 12 0.00369 0.00369 0.00369
11 GO:0016757 transferase activity, transferring glyco… 194 21 19.15 66 0.03258 0.00403 1.1e-05
12 GO:0060090 molecular adaptor activity 10 1 0.99 122 0.12197 0.00527 0.12197
14 GO:0004386 helicase activity 46 8 4.54 44 0.01886 0.00713 0.01886
15 GO:0005096 GTPase activator activity 34 4 3.36 16 0.00735 0.00735 0.00735
16 GO:0004312 fatty acid synthase activity 5 3 0.49 19 0.00912 0.00912 0.00912
17 GO:0005102 signaling receptor binding 30 4 2.96 112 0.09745 0.00973 0.09745
18 GO:0004842 ubiquitin-protein transferase activity 76 13 7.50 65 0.03179 0.01003 0.03179
20 GO:0008194 UDP-glycosyltransferase activity 35 5 3.45 10 0.00203 0.01153 0.00203
21 GO:0003993 acid phosphatase activity 8 1 0.79 26 0.01198 0.01198 0.01198
kable(
   MFvar.all[MFvar.all$Significant < MFvar.all$Expected,],
   caption = "Most under-represented terms of the Molecular Function ontology.")
Most under-represented terms of the Molecular Function ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
4 GO:0003743 translation initiation factor activity 28 1 2.76 4 0.00015 0.00015 0.00015
5 GO:0008061 chitin binding 77 7 7.60 5 0.00026 0.00026 0.00026
6 GO:0042578 phosphoric ester hydrolase activity 140 9 13.82 495 0.86923 0.00082 0.86923
8 GO:1901505 carbohydrate derivative transmembrane tr… 22 2 2.17 264 0.37713 0.00221 0.37713
10 GO:0036459 thiol-dependent ubiquitinyl hydrolase ac… 43 3 4.24 51 0.02297 0.00395 0.02297
13 GO:0005524 ATP binding 786 66 77.57 15 0.00571 0.00571 0.00571
19 GO:0015932 nucleobase-containing compound transmemb… 23 2 2.27 288 0.44074 0.01062 0.44074
22 GO:0004707 MAP kinase activity 6 0 0.59 27 0.01238 0.01238 0.01238
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(MFvar.weight01)[sigTerms]),
      caption = paste('Molecular functions terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Molecular functions terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0005515 protein binding Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules). 0.0000000
GO:0008536 Ran GTPase binding Interacting selectively and non-covalently with Ran, a conserved Ras-like GTP-binding protein, implicated in nucleocytoplasmic transport, cell cycle progression, spindle assembly, nuclear organization and nuclear envelope (NE) assembly. 0.0000471
GO:0043015 gamma-tubulin binding Interacting selectively and non-covalently with the microtubule constituent protein gamma-tubulin. 0.0001188
GO:0003743 translation initiation factor activity Functions in the initiation of ribosome-mediated translation of mRNA into a polypeptide. 0.0001548
GO:0008061 chitin binding Interacting selectively and non-covalently with chitin, a linear polysaccharide consisting of beta-(1->4)-linked N-acetyl-D-glucosamine residues. 0.0002588
GO:0008417 fucosyltransferase activity Catalysis of the transfer of a fucosyl group to an acceptor molecule, typically another carbohydrate or a lipid. 0.0011400
GO:0003714 transcription corepressor activity A protein or a member of a complex that interacts specifically and non-covalently with a DNA-bound DNA-binding transcription factor to repress the transcription of specific genes. Corepressors often act by altering chromatin structure and modifications. For example, one class of transcription coregulators modifies chromatin structure through covalent modification of histones. A second ATP-dependent class modifies the conformation of chromatin. A third class occludes DNA-binding transcription factor protein-protein interaction domains. A third class of corepressors prevents interactions of DNA bound DNA-binding transcription factor with coactivators. 0.0036872
GO:0005524 ATP binding Interacting selectively and non-covalently with ATP, adenosine 5’-triphosphate, a universally important coenzyme and enzyme regulator. 0.0057069
GO:0005096 GTPase activator activity Binds to and increases the activity of a GTPase, an enzyme that catalyzes the hydrolysis of GTP. 0.0073531
GO:0004312 fatty acid synthase activity Catalysis of the reaction: acetyl-CoA + n malonyl-CoA + 2n NADPH + 2n H+ = long-chain fatty acid + n+1 CoA + n CO2 + 2n NADP+. 0.0091154
showSigOfNodes(GOdataVar.MF, score(MFvar.weight01),
               firstSigNodes = sum(significant.elim),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 81 
## Number of Edges = 97 
## 
## $complete.dag
## [1] "A graph with 81 nodes."

I plot variance portion, but for the term found most significant when using p-values, for comparison.

showGroupDensity(GOdataVar.MF, names(score(MF.weight01))[which.min(score(MF.weight01))])

Cellular component

orderedTerms <- names(sort(score(CCvar.weight01)))
significant.weight01 <- score(CCvar.weight01)[orderedTerms] <= 0.01
significant.lea      <- score(CCvar.lea)[orderedTerms] <= 0.01
significant.elim     <- score(CCvar.elim)[orderedTerms] <= 0.01
sigTerms <- orderedTerms[significant.weight01 & significant.lea & significant.elim]
if (length(sigTerms) == 0) {
   sigTerms <- orderedTerms[significant.lea & significant.elim]
}
# sigTerms gets updated, and is already used in the code, but I need a permanent copy for later.
CC.variance.sigTerms <- sigTerms

CCvar.all <- GenTable(GOdataVar.CC, elim=CCvar.elim, weight01=CCvar.weight01, lea=CCvar.lea,
                      orderBy="weight01", ranksOf="elim", topNodes=max(sum(significant.elim), 10))

kable(
   CCvar.all[CCvar.all$Significant > CCvar.all$Expected,],
   caption = "Most over-represented terms of the Cellular Component ontology.")
Most over-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
3 GO:0000790 nuclear chromatin 24 3 2.34 41 0.03543 0.00231 0.03543
4 GO:0044428 nuclear part 236 34 23.02 7 0.00424 0.00354 3.4e-05
6 GO:0005680 anaphase-promoting complex 7 4 0.68 9 0.00648 0.00648 0.00648
8 GO:0030015 CCR4-NOT core complex 8 4 0.78 11 0.00806 0.00806 0.00806
9 GO:1902562 H4 histone acetyltransferase complex 10 2 0.98 35 0.02728 0.00821 0.02728
11 GO:0044451 nucleoplasm part 96 14 9.36 6 0.00349 0.01954 0.00349
kable(
   CCvar.all[CCvar.all$Significant < CCvar.all$Expected,],
   caption = "Most under-represented terms of the Cellular Component ontology.")
Most under-represented terms of the Cellular Component ontology.
GO.ID Term Annotated Significant Expected Rank in elim elim weight01 lea
1 GO:0005852 eukaryotic translation initiation factor… 13 0 1.27 1 0.00031 0.00031 0.00031
2 GO:0032991 protein-containing complex 886 84 86.41 12 0.01080 0.00229 8.5e-05
5 GO:0016272 prefoldin complex 7 0 0.68 8 0.00484 0.00484 0.00484
7 GO:0005783 endoplasmic reticulum 68 4 6.63 128 0.29861 0.00773 0.29861
10 GO:0005737 cytoplasm 634 53 61.83 42 0.03675 0.01317 0.02464
vennDiagram(vennCounts(cbind(weight01=significant.weight01,
                             lea=significant.lea,
                             elim=significant.elim)))

kable(data.frame(Term=Term(GOTERM[sigTerms]),
                 Definition=Definition(GOTERM[sigTerms]),
                 PValue=score(CCvar.weight01)[sigTerms]),
      caption = paste('Cellular component terms significantly associated with', VAR,
                      'according to all 3 algorithms', sep=' '))
Cellular component terms significantly associated with hatching according to all 3 algorithms
Term Definition PValue
GO:0005852 eukaryotic translation initiation factor 3 complex A complex of several polypeptides that plays at least two important roles in protein synthesis: First, eIF3 binds to the 40S ribosome and facilitates loading of the Met-tRNA/eIF2.GTP ternary complex to form the 43S preinitiation complex. Subsequently, eIF3 apparently assists eIF4 in recruiting mRNAs to the 43S complex. The eIF3 complex contains five conserved core subunits, and may contain several additional proteins; the non-core subunits are thought to mediate association of the complex with specific sets of mRNAs. 0.0003090
GO:0044428 nuclear part Any constituent part of the nucleus, a membrane-bounded organelle of eukaryotic cells in which chromosomes are housed and replicated. 0.0035361
GO:0016272 prefoldin complex A multisubunit chaperone that is capable of delivering unfolded proteins to cytosolic chaperonin, which it acts as a cofactor for. In humans, the complex is a heterohexamer of two PFD-alpha and four PFD-beta type subunits. In Saccharomyces cerevisiae, it also acts in the nucleus to regulate the rate of elongation by RNA polymerase II via a direct effect on histone dynamics. 0.0048382
GO:0005680 anaphase-promoting complex A ubiquitin ligase complex that degrades mitotic cyclins and anaphase inhibitory protein, thereby triggering sister chromatid separation and exit from mitosis. Substrate recognition by APC occurs through degradation signals, the most common of which is termed the Dbox degradation motif, originally discovered in cyclin B. 0.0064777
GO:0030015 CCR4-NOT core complex The core of the CCR4-NOT complex. In Saccharomyces the CCR4-NOT core complex comprises Ccr4p, Caf1p, Caf40p, Caf130p, Not1p, Not2p, Not3p, Not4p, and Not5p. 0.0080568
showSigOfNodes(GOdataVar.CC, score(CCvar.weight01),
               firstSigNodes = sum(significant.weight01),
               wantedNodes = sigTerms)

## $dag
## A graphNEL graph with directed edges
## Number of Nodes = 46 
## Number of Edges = 78 
## 
## $complete.dag
## [1] "A graph with 46 nodes."

For comparison, I plot the distribution of variance portion for the CC term found most significant when using p-values.

showGroupDensity(GOdataVar.CC, names(score(CC.weight01))[which.min(score(CC.weight01))])

Comparison between the two ordering of genes.

Biological process

allTerms <- usedGO(GOdata.BP)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% BP.pvalue.sigTerms,
                             Variance = allTerms %in% BP.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(BP.weight01))[allTerms],
                         Variance = rank(score(BPvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of BP terms by significance')

Molecular function

allTerms <- usedGO(GOdata.MF)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% MF.pvalue.sigTerms,
                             Variance = allTerms %in% MF.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(MF.weight01))[allTerms],
                         Variance = rank(score(MFvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of MF terms by significance')

Cellular component

allTerms <- usedGO(GOdata.CC)
vennDiagram(vennCounts(cbind(PValue   = allTerms %in% CC.pvalue.sigTerms,
                             Variance = allTerms %in% CC.variance.sigTerms)))

ggplot(data = data.frame(PValue   = rank(score(CC.weight01))[allTerms],
                         Variance = rank(score(CCvar.weight01))[allTerms]),
       mapping = aes(x = PValue, y = Variance)) +
   geom_point() + geom_smooth() + xlab('Using gene p-values') +
   ylab('Using portion of explained variance') +
   ggtitle('Ordering of CC terms by significance')

Session info

I save the main variables to be able to load them in a new R session later.

save(allgenes2GO,
     GOdata.BP, BP.elim, BP.weight01, BP.lea, BP.pvalue.sigTerms,
     GOdata.MF, MF.elim, MF.weight01, MF.lea, MF.pvalue.sigTerms,
     GOdata.CC, CC.elim, CC.weight01, CC.lea, CC.pvalue.sigTerms,
     GOdataVar.BP, BPvar.elim, BPvar.weight01, BPvar.lea, BP.variance.sigTerms,
     GOdataVar.MF, MFvar.elim, MFvar.weight01, MFvar.lea, MF.variance.sigTerms,
     GOdataVar.CC, CCvar.elim, CCvar.weight01, CCvar.lea, CC.variance.sigTerms,
     file = paste('Enrichment', TAG, VAR, 'RData', sep='.'))
sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=ca_ES.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=ca_ES.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=ca_ES.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] Rgraphviz_2.30.0     ggplot2_3.2.1        limma_3.42.0        
##  [4] knitr_1.27           topGO_2.38.1         SparseM_1.78        
##  [7] GO.db_3.10.0         AnnotationDbi_1.48.0 IRanges_2.20.2      
## [10] S4Vectors_0.24.3     Biobase_2.46.0       graph_1.64.0        
## [13] BiocGenerics_0.32.0 
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5   xfun_0.12          purrr_0.3.3        splines_3.6.2     
##  [5] lattice_0.20-38    colorspace_1.4-1   vctrs_0.2.1        htmltools_0.4.0   
##  [9] yaml_2.2.0         mgcv_1.8-31        blob_1.2.1         rlang_0.4.2       
## [13] pillar_1.4.3       glue_1.3.1         withr_2.1.2        DBI_1.1.0         
## [17] bit64_0.9-7        matrixStats_0.55.0 lifecycle_0.1.0    stringr_1.4.0     
## [21] munsell_0.5.0      gtable_0.3.0       memoise_1.1.0      evaluate_0.14     
## [25] labeling_0.3       highr_0.8          Rcpp_1.0.3         backports_1.1.5   
## [29] scales_1.1.0       farver_2.0.3       bit_1.1-15.1       digest_0.6.23     
## [33] stringi_1.4.5      dplyr_0.8.3        tools_3.6.2        magrittr_1.5      
## [37] lazyeval_0.2.2     tibble_2.1.3       RSQLite_2.2.0      crayon_1.3.4      
## [41] pkgconfig_2.0.3    zeallot_0.1.0      Matrix_1.2-18      assertthat_0.2.1  
## [45] rmarkdown_2.1      R6_2.4.1           nlme_3.1-143       compiler_3.6.2